1 Introduction

Health may refer to “a state of complete physical, mental and social well-being and not merely the absence of disease and infirmity.”, according to the World Health Organization (WHO). The United States just gone through the COVID-19 pandemic, and is 19th in the world in COVID-19 vaccination rates. Before this, USA faced many infamous outbreaks like Scarlet fever, Typhoid Mary,H1N1 flu, Measles etc. Most schools within the United States require vaccination, beginning in the 1850s. All 50 states in the U.S. require immunizations for children in order to enroll in public school. Climate change has been effecting the United States by exacerbating existing health threats and creating new challenges for the healthcare community to face. Air pollution, wild fires, food and waterborne disease, and mental health crisis are all observable effects of climate change.

Case surveillance is foundational to public health practice. It helps us to understand diseases and their spread and determine appropriate actions to control outbreaks. Case surveillance occurs each time public health agencies at the local, state, or national levels collect information about a case or person diagnosed with a disease or condition that poses a serious health threat to people. CDC (Centers for Disease Control and Prevention) conducts case surveillance through the National Notifiable Diseases Surveillance System (NNDSS). In the case surveillance process, about 3,000 health departments gather and use data on disease cases to protect their local communities. Through NNDSS, CDC receives and uses these data to keep people healthy and defend America from health threats.

The US weekly Nationally Notifiable Disease Surveillance Data from 1888 to 2013 contains information on 50 diseases reported by 50 US states and 1284 US cities. The data can be used to estimate seasonal and long-term transmissions trends, generate models for predictability of infectious disease outbreaks and conduct scientific research work.

In this project, we analyse this complete massive data.. We focus on prevalence trends of various diseases, its effect on all state of USA and the distribution of causalities. To achieve this aim, we use descriptive statistics and graphical methods to explore the data and identify patterns and trends. We present our findings and visualizations in the next sections.

2 Problem definition

We have the available dataset on disease in the United States from 1888 to 2014 with respective state, location and time. Specifically, we will focus on prevalence of diseases with different aspects.By creating graphs or charts, we seek to present a clear and intuitive depiction of the relative abundance of different diseases over time. The outcome of this analysis will provide valuable insights into the most significant diseases based on their prevalence. This information can be used to prioritize public health interventions, allocate resources effectively, and inform decision-making processes aimed at preventing, controlling, and managing these diseases.

3 Objectives

The objective of this analysis is to identify and rank the top most significant diseases based on their prevalence over a specific time period. The analysis aims to provide a comprehensive understanding of the diseases that have had the highest impact on public health, considering factors such as frequency, severity, and long-term consequences.

Specifically, the objectives are as follows:

  1. Explore and preprocess the disease dataset to ensure data quality and compatibility for analysis.
  2. Calculate the prevalence of each disease by determining the frequency of occurrence within the dataset.
  3. Calculate the morbidity rate for each disease by considering the reported cases in relation to the population(census 1910-2000) at risk.
  4. Analyze the top disease prevalence on a state-by-state basis to identify regional variations.
  5. Provide clear and intuitive visualization of most affected diseases of USA all time on different states.
  6. Finding prevalence of these three disease in all over USA
  7. Distribution of Deaths & Cases for these most affected diseases

By achieving these objectives, this analysis aims to contribute to the understanding of disease prevalence and prioritize the prevention, control, and management of the most impactful diseases, ultimately leading to more effective public health strategies and interventions.

4 Methods

The R programming language provides us with a variety of convenient tools that allow us to transform our data into visually appealing and informative graphs. These graphs play a crucial role in helping us understand and interpret the data more effectively. Throughout our analysis, we have utilized several types of graphs to visualize the information, making it easier to comprehend.

Apart from that, here we used Morbidity rate as a factor to calculate the severity of diseases.

The morbidity rate measures the portion of people in a specific geographical location who contracted a particular disease during a specific period of time. It indicates the frequency of the disease appearing in a population.The formula to calculate respective factor is

\[ Morbidity\ Rate\ Percentage\ = \frac{Total\ Number\ of\ Cases\ of\ disease}{Total\ Population}×100 \]

5 Data Analysis

5.1 Required Libraries

Here, the all libraries are mentioned which has been used in this project work:

library(tidyverse)
library(readr)
library(config)
library(data.table)
library(treemap)
library(treemapify)
library(plotly)
library(gganimate)
library(dplyr)
library(ggplot2)
library(ggpubr)
library(ggthemes)
library(ggridges)
library(stringr)
library(scales)
library(tibble)
library(tidyr)
library(magrittr)
library(forcats)

5.2 Data Pre-processing

5.2.1 Loading Data

Below shown code is used to load the data. As we have created the project environment for this work and directory, the all required data files are contained into same.

setwd("A:/DV23project/Project_Tycho") #Work directory
MainData<-read.csv("ProjectTycho_Level2_v1.1.0.csv", header=TRUE) #loading csv dataset with headers

5.2.2 Cleaning and Transformation of Data

As we have already cleaned and rearranged data, We are not require to do any cleaning process regarding the records. But as we require only limited columns, we remove the remaining column with below code.

cleanedata<-MainData %>% select(-c(country, from_date, to_date, url))

With below code, we can check the data type of all columns

glimpse(cleanedata)
## Rows: 3,659,360
## Columns: 7
## $ epi_week <int> 188824, 188824, 188824, 188826, 188826, 188826, 188826, 18882…
## $ state    <chr> "PA", "PA", "PA", "PA", "PA", "PA", "MD", "MD", "LA", "LA", "…
## $ loc      <chr> "PHILADELPHIA", "PHILADELPHIA", "PHILADELPHIA", "PHILADELPHIA…
## $ loc_type <chr> "CITY", "CITY", "CITY", "CITY", "CITY", "CITY", "CITY", "CITY…
## $ disease  <chr> "TYPHOID FEVER [ENTERIC FEVER]", "SCARLET FEVER", "DIPHTHERIA…
## $ event    <chr> "DEATHS", "DEATHS", "DEATHS", "DEATHS", "DEATHS", "DEATHS", "…
## $ number   <int> 14, 4, 4, 12, 5, 7, 4, 1, 1, 4, 1, 9, 4, 4, 1, 2, 6, 1, 1, 5,…

As we can see here, some column require class changing

cleanedata$state<-as.factor(cleanedata$state)
cleanedata$disease<-as.factor(cleanedata$disease)
cleanedata$event<-as.factor(cleanedata$event)

To get a year column, we can modify the epi_week column. As we can see, epi_week column has 6 digits where first four number show the year and last two number show the week number. To get add a year column from year we can use below code:

cleanedata$year<-substr(cleanedata$epi_week, 1, nchar(cleanedata$epi_week) - 2)
cleanedata$year<-as.integer(cleanedata$year)

Finally, we get dataset for our requirements of our analysis.

head(cleanedata)

We can check the unique values of all respective columns by using “unique” function. From there, we get to know that, we have 57 states and 50 diseases to analyze

5.3 Plotting graphs

5.3.1 The severity of each disease by determining the frequency of occurrence

In this, we have created heatmap to analyse the occurrence of each disease in every decades. We create different plot for “DEATHS” and “CASES”. Apart from that, we have added “Decade” column from year to plot a graph. For that, first need to filter data only from 1891-2010(because we need whole decades).

heat <- cleanedata %>% filter(year >= 1891 & year <= 2010) #filtering data for 1891-2010
heat$Decade <- paste0((floor((heat$year - 1) / 10) * 10 + 1), "-", ((floor((heat$year - 1) / 10) + 1) * 10)) #creating decade column
  1. In case of “DEATHS”:

filtering the data for only for event “DEATHS”

heatDeath<-heat %>% filter(event=="DEATHS")

grouping data with respect to state and Decade

heatDeath<-heatDeath %>% group_by(state, Decade) %>% summarize(DeathNumber=sum(number))

Now, we have our manipulated data. Additionally, we add below chunk to adjust the data so that it groups the data according to certain interval.

heatDeath$ScaleGroup <- cut(heatDeath$DeathNumber, breaks = c(0,10,100,1000,10000,100000,1000000, 2000000)) #Define scale group

Finally, plotting our data for heatmap:

DeathPlot <- c("0-10", "10-100", "100-1000", "1000-10000", "10000-100000", "100000-1000000", "1000000-2000000") #define custom legend scale name
ggplot(heatDeath, aes(state, Decade)) +
  geom_tile(aes(fill = ScaleGroup)) +
  scale_fill_manual(values = c("#00ff00","#99ff33", "#33cc33", "#339933", "#336633", "#333300", "#009999"), labels=DeathPlot) +
  theme_hc() +
  theme(
    text = element_text(size = 8),
    axis.text.x = element_text(angle = 90, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(size = 16, hjust = 0.5, face = "bold"),
    axis.title = element_text(size = 12, margin = margin(b = 20)),
    axis.text = element_text(margin = margin(t = 10))
  ) +
  labs(
    title = "Heatmap of Deaths caused by diseases in States(1890-2010)",
    fill = "Scale Group",
    x="State",
    y="Decade"
  )

  1. In case of “CASES”:

filtering the data for only for event “CASES”

heatCase<-heat %>% filter(event=="CASES")

grouping data with respect to state and Decade

heatCase<-heatCase %>% group_by(state, Decade) %>% summarize(CaseNumber=sum(number))

Now, we have our manipulated data. Additionally, we add below chunk to adjust the data so that it groups the data according to certain interval.

heatCase$ScaleGroup <- cut(heatCase$CaseNumber, breaks = c(0,10,100,1000,10000,100000,1000000, 2000000)) #Define scale group

Finally, plotting our data for heatmap:

CasePlot <- c("0-10", "10-100", "100-1000", "1000-10000", "10000-100000", "100000-1000000", "1000000-2000000") #define custom legend scale name
ggplot(heatCase, aes(state, Decade)) +
  geom_tile(aes(fill = ScaleGroup)) +
  scale_fill_manual(values = c("#ffff00", "#ffcc33", "#ff9966", "#ff3399", "#ff0066", "#990066", "#660000"), labels=CasePlot) +
  theme_hc() +
  theme(
    text = element_text(size = 8),
    axis.text.x = element_text(angle = 90, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(size = 16, hjust = 0.5, face = "bold"),,
    axis.title = element_text(size = 12, margin = margin(b = 20)),
    axis.text = element_text(margin = margin(t = 10))
  ) +
  labs(
    title = "Heatmap of Cases caused by diseases in States(1890-2010)",
    fill = "Scale Group",
    x="State",
    y="Decade"
  )

5.3.2 Morbidity rate of all disease with respect to census 1890-2010

The morbidity rate measures the portion of people in a specific geographical location who contracted a particular disease during a specific period of time. It indicates the frequency of the disease appearing in a population. Morbidity refers to the status of being ill or unhealthy. It includes the conditions of injury, disease, and disability.

In our case, we have 50 different diseases. To calculate morbidity rate, we require census data from 1890-2010 as well. So, we load the census population data of USA from 1790-2020.(Sources : https://en.wikipedia.org/wiki/United_States_census)

pop<-read.csv("pop-usa.csv", header=TRUE) #loading population data
head(pop)
pop<-pop%>% select(c(Year, Total.population)) #keeping necessary column
pop <-pop %>% filter(Year >= 1890 & Year <= 2010) #filtering data from 1890-2010
pop<-rename(pop, year=Year) #renaming column to match with main data set
pop$Total.population <- as.integer(gsub(",", "", pop$Total.population))

Now, we have to manipulate our main data set. Here, we take total number from deaths and cases as we need total infection from disease. First we need to replace year column with decade number. for example 1891,1892…1900 under 1900 decade(as per census year). Then, we can group by the set by year and disease.

mrate<-cleanedata
mrate <-mrate %>% filter(year >= 1891 & year <= 2010) #filtering data
mrate$year <- ifelse(mrate$year >= 1891 & mrate$year <= 1900, 1900, mrate$year) #making census decade column
mrate$year <- ifelse(mrate$year >= 1901 & mrate$year <= 1910, 1910, mrate$year)
mrate$year <- ifelse(mrate$year >= 1911 & mrate$year <= 1920, 1920, mrate$year)
mrate$year <- ifelse(mrate$year >= 1921 & mrate$year <= 1930, 1930, mrate$year)
mrate$year <- ifelse(mrate$year >= 1931 & mrate$year <= 1940, 1940, mrate$year)
mrate$year <- ifelse(mrate$year >= 1941 & mrate$year <= 1950, 1950, mrate$year)
mrate$year <- ifelse(mrate$year >= 1951 & mrate$year <= 1960, 1960, mrate$year)
mrate$year <- ifelse(mrate$year >= 1961 & mrate$year <= 1970, 1970, mrate$year)
mrate$year <- ifelse(mrate$year >= 1971 & mrate$year <= 1980, 1980, mrate$year)
mrate$year <- ifelse(mrate$year >= 1981 & mrate$year <= 1990, 1990, mrate$year)
mrate$year <- ifelse(mrate$year >= 1991 & mrate$year <= 2000, 2000, mrate$year)
mrate$year <- ifelse(mrate$year >= 2001 & mrate$year <= 2010, 2010, mrate$year)

mrate<-mrate %>% group_by(year, disease) %>% summarize(Number=sum(number)) #grouping as per requirement
mrate$year<- as.integer(mrate$year)

Now we need to join our main data set with population data set with respect to common “year” column.

mrate<-left_join(mrate, pop, by="year")

Now, we need to calculate Morbidity rate using population & number.

mrate<-mutate(mrate, MR=(Number/Total.population)*100)
mrate$MR <- round(mrate$MR, 3)
mrate <-mrate %>% filter(MR >= 0.01)

By calculating this, we get Morbidity rate for all disease per decade. Here, we find some values which is lower than 0.01. SO for better graph visualization, we take only values which is having Morbidity rate more than 0.01%. To create, a animated racing bar graph, we require to rank our records. so, for that, we follow below code:

mrate<-mrate %>% group_by(year) %>% mutate(rank=as.integer(rank(MR)))

Plotting animated chart:

staticplot <- ggplot(mrate, aes(rank, group = disease, fill=disease, color=disease)) +
  geom_tile(aes(y = MR/2,
                height = MR,
                width = 0.9), alpha = 0.8, color = NA)  +
  theme(lege3nd.position="none") +
  geom_text(aes(y = 0, label = paste(disease, " ")), vjust = 0.2, hjust = 1) +
  geom_text(aes(y=MR,label = MR, hjust=-0.5), color="black", size=4) +
  coord_flip(clip = "off", expand = FALSE) +
  scale_y_continuous(labels = scales::comma, limits = c(0, 5.5), breaks = seq(0.5, 5.5, 0.5)) +
  scale_x_reverse(labels = NULL) +
  guides(color=FALSE, fill=FALSE) + 
  xlab("") +ylab(" Morbidity rate (%)") +
  labs(title = 'Morbidity rate of disease in USA',
       subtitle  =  "Year : {closest_state}") +
  theme_bw() +
  transition_states(year, transition_length = 2, state_length = 2) +
  ease_aes() +
  theme(
    plot.title = element_text(size = 16, hjust = 0, face = "bold"),
    plot.subtitle=element_text(size=14, hjust=0, face="bold", color="grey"),
    axis.ticks.y = element_blank(),
    axis.text.y = element_text(size = 10, color = "black"),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    plot.margin = margin(2, 2, 0, 7, "cm"),
    plot.background = element_rect(fill = "white"),
    panel.background = element_rect(fill = "white"),
    panel.border = element_rect(color = "black", fill = NA),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    legend.background = element_blank(),
    legend.key = element_blank(),
    legend.text = element_text(size = 10),
    legend.title = element_blank()
  )

staticplot
animate(staticplot, fps = 5, width = 800, height = 538)

5.3.3 Analyze the top disease prevalence on a state-by-state basis

To analyse, top affected three disease for each state, we plot tree-map to show the proportionality of those disease.

To achieve this, we first group the data by State and disease.

StateData<-cleanedata %>% group_by(state, disease) %>% summarize(total_cases=sum(number))

Now, to analyze top affected disease, we need ranking, so that we can define most affected three disease for each state.

StateData <- StateData %>% group_by(state) %>% mutate(rank=as.integer(rank(desc(total_cases)))) #ranking the disease
StateData<-StateData %>% filter(rank<=3)

Plotting tree-map:

treemap(StateData, index = c("state","disease"),
              vSize = "total_cases", vColor = "disease", type = "index", bg.labels = 0,
              title = "All states with most affected three diseases", border.col = c("white"),
              border.lwds = c(0,0), palette = ("Set2"),
              fontcolor.labels = c("white", "black"),
              align.labels = list(c("left","top"),c("right","bottom"))) 

5.3.4 Top disease prevalence on a state-by-state basis

By analyzing above data, we get to know that Measles, Scarlet Fever and Influenza were having high impact on USA health history. USA faced epidemic situation caused by these diseases. These diseases are infamous for their out breaks in USA history.

So, here we focus on these three diseases. First we develop Choropath map for these all diseases to get clear idea of how these diseases affected individual USA states.

In this work, we used Plotly library. Plotly is open source graphing library provide interactive charts and maps for R programming. In this plotly based interactive graph, We can check data for any year using handlebar below map and can get the case number of particular state by hovering cursor on it. (Note:States having full white color doesn’t have any reported case/death data for that year, So hover won’t show any data for that.)

  1. Cases/Deaths of Measles in all states between 1906-2001

First, we filter the main data for “MEASLES” and grouping data with respect to year and state. We have to take note that Measles doesn’t have any DEATH casualty. So, the whole plotting will have only Case event.

DM<-cleanedata %>% filter(disease=="MEASLES") #filter MEASLES
DM<-DM %>% group_by(year, state) %>% summarize(TotalNumber=sum(number))

To display full state name, we require external data to get full names(As we only have state code in our main data). (Source:https://www.kaggle.com/datasets/justinrwong/us-states-to-abbreviations)

code<-read.csv("table-usa.csv", header=TRUE) #loading State name data
DM<-left_join(DM, code, by="state") #joining code data with main data

Plotting final map:

labelstyle1<-list(bgcolor="#dd571c", bordercolor="transparent", font=list(color="White"))
DM$label <- with(DM, paste(State, '<br>', "Total Cases", TotalNumber))
mfig <- plot_geo(DM, locationmode='USA-states', frame = ~year) %>%
  add_trace(locations=~state, z=~TotalNumber, text=~label, hoverinfo='text', zmin=0, zmax=195000, color=~TotalNumber, colors='Reds') %>%
  layout(geo=list(scope='usa'), title= list(text = 'Cases of Measles in USA States during 1906-2001', font = list(color = "black", size = 16, weight = "bold"))) %>%
  style(hoverlabel=labelstyle1) %>%
  config(displayModeBar=FALSE)
mfig
  1. Cases/Deaths of Scarlet Fever in all states between 1888-1966

Similary, we filter the main data for “Scarlet Fever” and grouping data with respect to year, state and event. Here, We have deaths and cases in Scarlet Fever, So we convert our event column value to two new rows.

DS<-cleanedata %>% filter(disease=="SCARLET FEVER") #filter SCARLET FEVER
DS<-DS %>% group_by(year, state, event) %>% summarize(TotalNumber=sum(number))
DS<- DS %>% pivot_wider(names_from=event, values_from=TotalNumber)
DS$DEATHS[is.na(DS$DEATHS)]<-0 #replacing NA values
DS$CASES[is.na(DS$CASES)]<-0
DS<-mutate(DS, TotalNumber=DEATHS+CASES) #keeping TotalNumber

To display full state name, we require external data to get full names(As we only have state code in our main data). (Source:https://www.kaggle.com/datasets/justinrwong/us-states-to-abbreviations)

code<-read.csv("table-usa.csv", header=TRUE) #loading State name data
DS<-left_join(DS, code, by="state") #joining code data with main data

Plotting final map:

labelstyle2<-list(bgcolor="#234f1e", bordercolor="transparent", font=list(color="White"))
DS$label <- with(DS, paste(State, '<br>', "Total Number", TotalNumber, '<br>', "Deaths", DEATHS, '<br>', "Cases", CASES))
sfig <- plot_geo(DS, locationmode='USA-states', frame = ~year) %>%
  add_trace(locations=~state, z=~TotalNumber, text=~label, hoverinfo='text', zmin=0, zmax=67000, color=~TotalNumber, colors='Greens') %>%
  layout(geo=list(scope='usa'),title= list(text = 'Cases of Scarlet Fever in USA States during 1888-1966', font = list(color = "black", size = 16, weight = "bold"))) %>%
  style(hoverlabel=labelstyle2) %>%
  config(displayModeBar=FALSE)
sfig
  1. Cases/Deaths of Influenza in all states between 1919-1951

Similarly, we filter the main data for “Scarlet Fever” and grouping data with respect to year, state and event. Here, We have deaths and cases in Influenza, So we convert our event column value to two new rows.

DI<-cleanedata %>% filter(disease=="INFLUENZA") #filter INFLUENZA
DI<-DI %>% group_by(year, state, event) %>% summarize(TotalNumber=sum(number))
DI<-DI %>% pivot_wider(names_from=event, values_from=TotalNumber)
DI$DEATHS[is.na(DI$DEATHS)]<-0 #replacing NA values
DI$CASES[is.na(DI$CASES)]<-0
DI<-mutate(DI, TotalNumber=DEATHS+CASES) #keeping TotalNumber

To display full state name, we require external data to get full names(As we only have state code in our main data). (Source:https://www.kaggle.com/datasets/justinrwong/us-states-to-abbreviations)

code<-read.csv("table-usa.csv", header=TRUE) #loading State name data
DI<-left_join(DI, code, by="state") #joining code data with main data

Plotting final map:

labelstyle3<-list(bgcolor="#ef477d", bordercolor="transparent", font=list(color="White"))
DI$label <- with(DI, paste(State, '<br>', "Total Number", TotalNumber, '<br>', "Deaths", DEATHS, '<br>', "Cases", CASES))
ifig <- plot_geo(DI, locationmode='USA-states', frame = ~year) %>%
  add_trace(locations=~state, z=~TotalNumber, text=~label, hoverinfo='text', zmin=0, zmax=67000, color=~TotalNumber, colors='RdPu') %>%
  layout(geo=list(scope='usa'),title= list(text = 'Cases of Influenza in USA States during 1919-1951', font = list(color = "black", size = 16, weight = "bold"))) %>%
  style(hoverlabel=labelstyle3) %>%
  config(displayModeBar=FALSE)
ifig

5.3.5 Finding prevalence of these three disease in all over USA

To find the gradually increment of these disease in all over USA with given time, we use line graph.

First, we group the data with year, disease and event

data_usa<-cleanedata %>% group_by(year, disease) %>% summarize(TotalNumber=sum(number))

To filter data for only three disease, we use subset function

dsc<-c("MEASLES", "SCARLET FEVER", "INFLUENZA") #creating list of name of diseases
data_usa<-subset(data_usa, disease %in% dsc) #subset the main data with list

Plotting graph:

xscale <- seq(1885, 2015, by = 10)
yscale <- seq(1, 1100000, by = 50000)

ggplot(data_usa, aes(x = year, y = TotalNumber, color = disease)) +
  geom_line(size = 1) +
  scale_x_continuous(breaks = xscale) +
  scale_y_continuous(labels = scales::comma, expand = c(0, 0), limits = c(0, 1100000), breaks = seq(50000, 1100000, 50000)) +
  labs(title = "Casualties caused by Most affected Disease in USA",
       x = "Affected Year",
       y = "Number of Total Cases",
       color = "Name of Disease") +
  theme_bw() +
  theme(
    plot.title = element_text(size = 16, hjust = 0.5, face = "bold"),
    axis.ticks.y = element_blank(),
    axis.text.y = element_text(size = 10, color = "black"),
    axis.text.x = element_text(size = 10, color = "black", angle = 45, hjust = 1),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    plot.background = element_rect(fill = "white"),
    panel.background = element_rect(fill = "white"),
    panel.border = element_rect(color = "black", fill = NA),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    legend.background = element_blank(),
    legend.key = element_blank(),
    legend.text = element_text(size = 10),
    legend.title = element_blank()
  )

5.3.6 Distribution of Deaths & Cases of most affected diseases in different region of USA

For the depiction of distribution of Deaths & Cases, we use box plot.

To filter data for only three disease, we use subset function

dsc<-c("MEASLES", "SCARLET FEVER", "INFLUENZA") #creating list of name of diseases
dist<-subset(cleanedata, disease %in% dsc) #subset the main data with list

To achieve this, we load USA State-Region data (Source : https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf) and make join with main data.

usregion<-read.csv("USregion.csv") #loading Region data
usregion<-rename(usregion, state=State.Code)
dist<-right_join(cleanedata, usregion, by="state") #joining data with main data
dist<-dist %>% group_by(Region, event) %>% summarize(TotalNumber=sum(number)) #grouping data with regions

Plotting graph:

ggplot(dist, aes(x = TotalNumber, y = Region, fill = Region)) +
    geom_boxplot() +
    scale_x_continuous(trans = "log2", labels = comma, breaks = trans_breaks("log2", function(x) 2^x)) +
    scale_fill_brewer(palette = "Dark2") + 
    theme_minimal() +
    theme(legend.position = "none",
          text = element_text(size = 11),
          plot.title = element_text(size = 16, hjust = 0.5, face = "bold"),
          axis.text.y = element_text(size = 10, color = "black"),
          axis.text.x = element_text(size = 10, color = "black", angle = 45, hjust = 1),
          axis.title = element_text(size = 12, face = "bold"),
          panel.grid.major = element_line(color = "gray80"),
          panel.grid.minor = element_blank()) +
    labs(title = "Deaths and Cases Distribution with respect to Region of USA",
         x = "Total Number",
         y = "Region")

6 Results

So, we were basically trying to figure out top 3 disease that affected the states of the U.S the most from 1888 to 2014. To achieve this, we analysed different aspect of this dataset. During this process, we found very interesting discoveries. We looked at the data, made some appropriate graphs to show the important findings, and got a clear idea of the top 3 disease that was the most common during those years.

7 Discussion

The analysis of disease prevalence in the United States from 1888 to 2014, categorized by state. By examining the prevalence of diseases across different states, we gain a comprehensive understanding of the impact and distribution of various health conditions in the country.The data reveals compelling trends and patterns in disease prevalence at a state level, allowing us to identify states that experienced higher rates of specific diseases during specific time periods. Additionally, we got an idea about the top most affected disease in country.

However, it is essential to acknowledge that comprehending plain data can be challenging. While plots are valuable tools for analysis, they must be carefully constructed to provide meaningful insights to readers. Precision and specificity in producing plots and graphs are crucial to ensure that they effectively communicate the underlying data and enable readers to derive valuable interpretations. graphical representation simplifies the understanding of complex data, it is important to strike a balance between providing accessible visualizations and maintaining accuracy and integrity by presenting well-designed and targeted plots.

8 Conclusion

In conclusion, this study shows how using data visualization can greatly improve data analysis and communication in research. By visually representing information, researchers can better understand and convey complex data in an easier and more effective way. This helps others understand and engage with the research findings more easily. In conclusion, data visualization plays a key role in enhancing research by making data analysis and communication more accessible and impactful.

9 References

  1. Project Tycho: Home https://www.tycho.pitt.edu/
  2. U.S. Census Bureau, 2020 Censuses of Population and the population estimate program. https://data.ers.usda.gov/reports.aspx?ID=17827
  3. National Notifiable Diseases Surveillance System (NNDSS) https://www.cdc.gov/nndss/about/index.html
  4. Health in the United States (Wikipedia) https://en.wikipedia.org/wiki/Health_in_the_United_States
  5. Basic Statistics: About Incidence, Prevalence, Morbidity, and Mortality - Statistics Teaching Tools https://www.health.ny.gov/diseases/chronic/basicstat.htm
  6. United States census (Wikipedia) https://en.wikipedia.org/wiki/United_States_census
  7. The Worst Outbreaks in U.S. History https://www.healthline.com/health/worst-disease-outbreaks-history
  8. US States to Abbreviations https://www.kaggle.com/datasets/justinrwong/us-states-to-abbreviations
  9. gganimate: Getting Started https://gganimate.com/articles/gganimate.html
  10. Plotly r graphing library in R https://plotly.com/r/
  11. Morbidity Rate https://corporatefinanceinstitute.com/resources/wealth-management/morbidity-rate/
  12. US census bureau regions and divisions https://github.com/cphalpert/census-regions/blob/master/us%20census%20bureau%20regions%20and%20divisions.csv
  13. R Markdown https://rmarkdown.rstudio.com/index.html